import io
import requests
import numpy as np
import pandas as pd
import scipy.stats as ss
import matplotlib.pyplot as plt
from collections import Counter
from datetime import datetime
from pandas_profiling import ProfileReport
%matplotlib inline
# Importing Dataset
## Create empty dataframe
df = pd.DataFrame()
## Create date range for months of the year 2019 to be used in building URL
months = pd.date_range('1/1/2019','1/1/2020',freq="M").strftime('%Y%m')
## Iterate over each month and create a URL for that month, pull the csv data
## and append that data to the dataframe
for m in months:
URL = "http://www.bom.gov.au/climate/dwo/" + m + "/text/IDCJDW2801." + m + ".csv"
## Use request module as pandas native read_csv throws 403 error due to lack of headers
r = requests.get(URL).text
df = df.append(pd.read_csv(io.StringIO(r), sep=",", skiprows=6))
## Set the "Date" column to Datetime type
df["Date"] = pd.to_datetime(df["Date"])
## Set the "Date" column to Datetime index
df = df.set_index("Date")
## Use index to create month column using "Unnamed: 0" column
df["Unnamed: 0"] = df.index.strftime("%b")
## Rename column to "Month"
df.rename(columns = {"Unnamed: 0": "Month"}, inplace = True)
# Shape of our data (353, 21) or 365 observations with 21 variables
# Because I've changed the "Date" column to be an index it is not included
# in the count of variables.
print(df.shape)
df
To view the variables that have an inconsistent data type, I use pandas_profiling module which produce an easy to read html report of the data variables and many useful characteristics of the data. The report can be viewed via the html version hosted on my personal github.
#for col in df.columns:
# print(df[col].value_counts())
profile = ProfileReport(df, title = "2019 Canberra, Australia Weather Data", html={'style':{'full_width':True}})
profile.to_notebook_iframe()
# Drop "Time of maximum wind gust"
df.drop(columns="Time of maximum wind gust", inplace = True)
# Rename variables
df.rename(columns = {"Minimum temperature (°C)":"MinTemp",
"Maximum temperature (°C)":"MaxTemp",
"Rainfall (mm)":"Rainfall",
"Evaporation (mm)":"Evaporation",
"Sunshine (hours)":"Sunshine",
"Direction of maximum wind gust ":"WindGustDir",
"Speed of maximum wind gust (km/h)":"WindGustSpeed",
"9am Temperature (°C)":"Temp9am",
"9am relative humidity (%)":"Humidity9am",
"9am cloud amount (oktas)":"Cloud9am",
"9am wind direction":"WindDir9am",
"9am wind speed (km/h)":"WindSpeed9am",
"9am MSL pressure (hPa)":"Pressure9am",
"3pm Temperature (°C)":"Temp3pm",
"3pm relative humidity (%)":"Humidity3pm",
"3pm cloud amount (oktas)":"Cloud3pm",
"3pm wind direction":"WindDir3pm",
"3pm wind speed (km/h)":"WindSpeed3pm",
"3pm MSL pressure (hPa)":"Pressure3pm"}, inplace = True)
# Shape of data is one less than if we were to look at it in R
# this is because I am treating the "Date" column as an index
df.shape
df.columns
df
# Replace all instances of "Calm" with numeric "0"
df.replace("Calm",0, inplace = True)
# Current types of variables.
df.dtypes
# Change types of variables to an appropriate type
df[["WindGustSpeed","Cloud9am","Cloud3pm",
"WindSpeed9am","WindSpeed3pm"]] = df[["WindGustSpeed","Cloud9am","Cloud3pm","WindSpeed9am","WindSpeed3pm"]].apply(pd.to_numeric)
df[["Month","WindGustDir","WindDir9am","WindDir3pm"]] = df[["Month","WindGustDir","WindDir9am","WindDir3pm"]].astype("category")
df.dtypes
# Define new binary variable "RainToday" that is 1 if rain was more than 1mm and 0 otherwise
df["RainToday"] = df["Rainfall"].apply(lambda x: 1 if x > 1 else 0)
# Define new binary variable "RainTomorrow" that is the 1-day shifted copy of "RainToday"
df["RainTomorrow"] = df["RainToday"].shift(periods=1, fill_value=0)
# Save a csv of cleaned data which can be read in via R
df.to_csv("project_1.csv")
# Provided example
m_r = pd.crosstab(df["Month"], columns=df["RainTomorrow"])
m_r
# Chisquare test for Month and RainTomorrow
# p-value: 0.017 hence at sig. level 0.05 no independence
ss.chi2_contingency(m_r)
As an initial investigation into the cleaned dataset, I again create a profile report using pandas_profiling module:
profile = ProfileReport(df, title = "Project 1 - 2019 Canberra, Australia Weather Data", html={'style':{'full_width':True}})
profile.to_notebook_iframe()
The first observation that I made was born from a question that I had regarding days with constant temperatures. I wondered if days with more constant temperature (this I defined as the difference between maximum temperature and minimum temperature) depended on the month of the year. In otherwords, do days with small differences in max and min temperatures depend on the time of the year in Canberra, Australia.
First we can look at the average difference between max and min temperatures for each month, as summerized below. The "Date" column is a numer representation of the month of the year. The floating point value in the second column is the average max-min temperature difference for the given month.
# Create new variable "MinMaxDiff"
df["MinMaxDiff"] = df["MaxTemp"] - df["MinTemp"]
# Show average "MinMaxDiff" for each month
df.groupby(df.index.month)["MinMaxDiff"].mean()
From initial observation, it looks like constant temperature does depend on the month of the year. In other words, the data seems to suggest that during the winter months in Canberra Australia, there is a more constant temperature or a smaller difference between the max and min temperature for the day.
To test this, we can use an One-Way ANOVA test. In order for the ANOVA test to be valid we must check the assumption that the population of max-min differences is normally distributed as well as homogeneity of variances.
# Q-Q Plot for normality check
ss.probplot(df["MinMaxDiff"],plot=plt);
The above Q-Q plot shows that the population data is relatively normal. With real world data, it is hard to observe absolute normal distributions and for this purpose I assume that normality holds true.
# Levene's test for homogeneity of variance
ss.levene(df[df["Month"]=="Jan"]["MinMaxDiff"], df[df["Month"]=="Feb"]["MinMaxDiff"],df[df["Month"]=="Mar"]["MinMaxDiff"],
df[df["Month"]=="Apr"]["MinMaxDiff"],df[df["Month"]=="May"]["MinMaxDiff"],df[df["Month"]=="Jun"]["MinMaxDiff"],
df[df["Month"]=="Jul"]["MinMaxDiff"],df[df["Month"]=="Aug"]["MinMaxDiff"],df[df["Month"]=="Sep"]["MinMaxDiff"],
df[df["Month"]=="Oct"]["MinMaxDiff"],df[df["Month"]=="Nov"]["MinMaxDiff"],df[df["Month"]=="Dec"]["MinMaxDiff"])
To test for homogeneity of variances, I have used Levene's test and passed all variables into to verify equal variances. We can see with a p-value more than significance level 0.05 we can not reject the null and therefore assume homogeneity of variance.
ss.f_oneway(df[df["Month"]=="Jan"]["MinMaxDiff"], df[df["Month"]=="Feb"]["MinMaxDiff"],df[df["Month"]=="Mar"]["MinMaxDiff"],
df[df["Month"]=="Apr"]["MinMaxDiff"],df[df["Month"]=="May"]["MinMaxDiff"],df[df["Month"]=="Jun"]["MinMaxDiff"],
df[df["Month"]=="Jul"]["MinMaxDiff"],df[df["Month"]=="Aug"]["MinMaxDiff"],df[df["Month"]=="Sep"]["MinMaxDiff"],
df[df["Month"]=="Oct"]["MinMaxDiff"],df[df["Month"]=="Nov"]["MinMaxDiff"],df[df["Month"]=="Dec"]["MinMaxDiff"])
Finally, we perform the One-way ANOVA test. We can see from the very small p-value that is significantly less than significance level 0.05, suggests that the difference in max-min temperatures does indeed depend on the month of the year. If we take the max-min difference to imply more constant temperatures, then we can make the argument that consistency of temperature in Canberra, Australia depends on the month of the year.
The second observation between the relationship of humidity and the likelihood of rain tomorrow. There are two parts to this as our dataset has two measurements of humidity taken at 9am and 3pm. I will look at humidity measured at 9am.
Also, as humidity in our dataset is measured as a continous variable, I needed to discretize humidity to be able to perform the chisquare test. I do so by splitting humidity measures equally into 10 groups. Generally, this creates bins of the same size spanning from the minimum measured humidity level to 100 (the maximum observed/posible humidity).
# Create cross table between "Humidity9am" and "RainTomorrow"
x = pd.crosstab(pd.cut(df["Humidity9am"],10),df["RainTomorrow"])
x
ss.chi2_contingency(x)
For "Humidity9am" you can see that there is very little variation in the liklihood of raining due to the humidity measured at 9am. This is seen by the large p-value of 0.32 which is greater than our significance level 0.05 and suggests that there is little evidence against the null hypothesis. I.e. the variables are independent.
Even though in the previous observation we saw independence between the two variables, I thought it might be interesting to see if that result held for humidity measured at 3pm.
I perform the same steps as above for "Humidity3pm"
# Create cross table between "Humidity3pm" and "RainTomorrow"
x = pd.crosstab(pd.cut(df["Humidity3pm"],10),df["RainTomorrow"])
x
ss.chi2_contingency(x)
Suprisingly, perhaps, from the above chisquared test we see that there is a potential relationship between the likelihood of rain and humidity measured at 3pm. The p-value of 0.0038 is less than our significance level 0.05 and therefore we reject the null hypothesis and conclude that there is variation of rain chances with humidity measured at 3pm.
This is opposed to the result that we obtained with humidity measured at 9am and might give us some intuition into the overall relationship between humidity and rain chances. In otherwords, the data suggest that rain chances are dependent on humidty measurements later in the day. This may seem intuitive as these measurments are temporally closer to the new day.